AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Pikus, Benjamin, Tiwari, Pratyush Ranjan, Ye, Burton

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

arXiv.org Artificial IntelligenceOct-15-2025

Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10\% of examples (those where the base model fails most often) yields dramatic performance gains up to 47\%, while easy examples produce minimal improvements of 3-15\%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning

artificial intelligence, inductive learning, machine learning, (20 more...)

2508.14094

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsAug-22-2025, 00:48:37 GMT

A Appendix

Hyperparameters V alue Number of encoder (decoder) layers 6 Number of layers in the feed forward network 2 Number of hidden units in the feed forward network 128 Mask filter size 3 Mask number of filters 16 Ratio of residual connection 1.5 Dropout rate 0.1 Optimizer Adam Warm-up steps 4000 Learning rate p d min ( p t, t 4000 Unless otherwise specified, the task performed in this section is selection sort (Section 4). Figure 6 shows the sorting performance of the transformers w/o mask supervision. Figure 7 shows sorting performances with different encoding schemes. In Figure 9, we show the strong generalization performance of the different architectures. While some changes are able to improve performance in this regime, the performance ultimately drops steeply as the length of the test sequence increases. The symbol e represents the end token.

graph, positional encoding, transformer, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.55)

Ambati, Rajeev Bhatt, Lester, James, Srivastava, Shashank, Chaturvedi, Snigdha

MarginSel : Max-Margin Demonstration Selection for LLMs

arXiv.org Artificial IntelligenceJun-10-2025

Large Language Models (LLMs) excel at few-shot learning via in-context learning (ICL). However, the effectiveness of ICL is often sensitive to the selection and ordering of demonstration examples. To address this, we present MarginSel: Max-Margin Demonstration Selection for LLMs, a two-step method that selects hard demonstration examples for the ICL prompt, adapting to each test instance. Our approach achieves 2-7% absolute improvement in F1-score across classification tasks, compared to a random selection of examples. We also provide theoretical insights and empirical evidence showing that MarginSel induces max-margin behavior in LLMs by effectively increasing the margin for hard examples, analogous to support vectors, thereby shifting the decision boundary in a beneficial direction.

large language model, machine learning, marginsel, (19 more...)

2506.06699

Country: North America > United States (0.68)

Genre: Research Report (0.83)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.34)

Neural Information Processing SystemsJan-23-2025, 01:25:47 GMT

Review for NeurIPS paper: SuperLoss: A Generic Loss for Robust Curriculum Learning

Additional Feedback: Further comments: - The definition of hard and easy examples is limited to their respective confidence scores or losses. Although previous work has similar definitions, confidence or loss are not always good indicators of true easiness or hardness of samples, e.g. they could be erroneous at early iterations. The paper lacks an experiment that illustrates the validity of the above definition. These are probably hard or noisy examples that were mistreated as easy examples by the model? These are probably a mixture of easy, hard, and noisy examples with low confidence across the loss spectrum that were mistreated as hard examples by the model.

confidence score, neurips paper, robust curriculum learning, (10 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.36)

Neural Information Processing SystemsJan-19-2025, 20:19:12 GMT

CS-Isolate: Extracting Hard Confident Examples by Content and Style Isolation

content and style isolation, content factor, extracting hard confident example, (4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Maveli, Nickil, Vergari, Antonio, Cohen, Shay B.

What can Large Language Models Capture about Code Functional Equivalence?

arXiv.org Artificial IntelligenceAug-20-2024

Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.

codellama-34b, starcoder2-15b, transformation, (16 more...)

2408.11081

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
(7 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

arXiv.org Artificial IntelligenceAug-16-2024

Maximizing V-information for Pre-training Superior Foundation Models

Yang, Wenxuan, Tan, Weimin, Zhang, Hanyu, Yan, Bo

Pre-training foundation models on large-scale datasets demonstrates exceptional performance. However, recent research questions this traditional notion, exploring whether an increase in pre-training data always leads to enhanced model performance. To address this issue, data-effective learning approaches have been introduced. However, current methods in this area lack a clear standard for sample selection. Our experiments reveal that by maximizing V-information, sample selection can be framed as an optimization problem, enabling effective improvement in model performance even with fewer samples. Under this guidance, we develop an optimal data-effective learning method (OptiDEL) to maximize V-information. The OptiDEL method generates hard samples to achieve or even exceed the performance of models trained on the full dataset while using substantially less data. We compare the OptiDEL method with state-of-the-art approaches finding that OptiDEL consistently outperforms existing approaches across different datasets, with foundation models trained on only 5% of the pre-training data surpassing the performance of those trained on the full dataset.

dataset, foundation model, model performance, (15 more...)

2408.07107

Country:

Europe > Austria (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Verma, Bhuvanesh, Raithel, Lisa

DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

arXiv.org Artificial IntelligenceMay-1-2024

Building on the methodology outlined (NLP) has seen significant advancements, beginning by Kanakarajan and Sankarasubbu (2023), with the introduction of word embeddings we assessed the zero-shot performance of various (Mikolov et al., 2013), followed by transformer instruction-tuned LLMs to identify the most effective architectures like BERT (Vaswani et al., 2017; Devlin model. Upon selecting the best LLM, we introduced et al., 2019), and specialized language models an auxiliary module during the fine-tuning (LMs) such as BioBERT (Lee et al., 2020) and process, which emphasized learning "hard" examples. PubMedBERT (Gu et al., 2021) tailored for the Taking inspiration from Korakakis and Vlachos biomedical domain. The advent of large language (2023), who experimented with various configurations models (LLMs) like GPT-3 (Brown et al., 2020), for the auxiliary module and highlighted commonly known as Chat-GPT, has further pushed its substantial impact on the final NLI system's the boundaries of NLP, showcasing capabilities performance, we explored various architectures for in diverse NLP tasks and even reasoning.

dataset, intervention, nli4ct dataset, (16 more...)

2405.00321

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
(4 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceApr-28-2024

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

Li, Wei, Ma, Ren, Wu, Jiang, Gu, Chenya, Peng, Jiahui, Len, Jinyang, Zhang, Songyang, Yan, Hang, Lin, Dahua, He, Conghui

In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.

benchmark, knowledge, language model, (16 more...)

2404.18359

Country: Asia > China > Shanghai > Shanghai (0.05)

Genre: Research Report > New Finding (0.48)

Industry: Education > Educational Setting > K-12 Education (0.89)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)